Take-home_Ex1

1 Overview

As urban infrastructures become increasingly digitized, data from sources like buses, taxis, and public utilities offer valuable insights into movement patterns over time and space. The widespread use of technologies like GPS and RFID in vehicles has generated massive datasets, including route and ridership data. Analyzing these datasets can reveal important patterns and characteristics of human movement in a city, benefiting urban management and transport service providers.

This study aims to leverage Exploratory Spatial Data Analysis (ESDA), specifically Local Indicators of Spatial Association (LISA) and Emerging Hot Spot Analysis (EHSA), to uncover spatial and spatio-temporal mobility patterns among public bus passengers in Singapore.

1.1 Packages Used

The R packages used for the analysis are as follows:

  • sf: Analyzes and models spatial dependencies in data.

  • tmap: Creates thematic maps for visualizing geospatial data.

  • tidyverse: A collection of R packages with a unified approach for data manipulation and visualization.

  • plotly: Enables interactive and dynamic data visualizations.

  • zoo: Handles and analyzes time series data.

  • Kendall: Computes Kendall’s tau rank correlation coefficient for assessing rank-based associations.

  • kableExtra: Enhances ‘knitr’ package’s ‘kable()’ function for styling HTML and LaTeX tables in R Markdown. It offers advanced formatting options like row/column customization, conditional styling, and captioning, elevating tables to publication quality.

  • ggrain: R-package that allows you to create Raincloud plots - following the ‘Grammar of Graphics’ (i.e., ggplot2)

  • DT: Create interactive html tables

Code
pacman::p_load(sf, sfdep, tmap, tidyverse, plotly, zoo, Kendall, kableExtra, ggrain, DT)

1.2 Data Import

1.2.1 Apsatial data - Origin_destination_bus data

Code
# Load each csv file into R separately
bus08 <- read_csv("data/aspatial/origin_destination_bus_202308.csv")
bus09 <- read_csv("data/aspatial/origin_destination_bus_202309.csv")
bus10 <- read_csv("data/aspatial/origin_destination_bus_202310.csv")

# Combine all rows into single dataframe
origind <- rbind(bus08, bus09, bus10)

str(origind)
spc_tbl_ [17,118,005 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ YEAR_MONTH         : chr [1:17118005] "2023-08" "2023-08" "2023-08" "2023-08" ...
 $ DAY_TYPE           : chr [1:17118005] "WEEKDAY" "WEEKENDS/HOLIDAY" "WEEKENDS/HOLIDAY" "WEEKDAY" ...
 $ TIME_PER_HOUR      : num [1:17118005] 16 16 14 14 17 17 17 17 7 17 ...
 $ PT_TYPE            : chr [1:17118005] "BUS" "BUS" "BUS" "BUS" ...
 $ ORIGIN_PT_CODE     : chr [1:17118005] "04168" "04168" "80119" "80119" ...
 $ DESTINATION_PT_CODE: chr [1:17118005] "10051" "10051" "90079" "90079" ...
 $ TOTAL_TRIPS        : num [1:17118005] 7 2 3 10 5 4 3 22 3 3 ...
 - attr(*, "spec")=
  .. cols(
  ..   YEAR_MONTH = col_character(),
  ..   DAY_TYPE = col_character(),
  ..   TIME_PER_HOUR = col_double(),
  ..   PT_TYPE = col_character(),
  ..   ORIGIN_PT_CODE = col_character(),
  ..   DESTINATION_PT_CODE = col_character(),
  ..   TOTAL_TRIPS = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
head(origind,10) %>%
  kbl() %>%
  kable_styling(
    full_width = F, 
    bootstrap_options = c("condensed", "responsive"))
YEAR_MONTH DAY_TYPE TIME_PER_HOUR PT_TYPE ORIGIN_PT_CODE DESTINATION_PT_CODE TOTAL_TRIPS
2023-08 WEEKDAY 16 BUS 04168 10051 7
2023-08 WEEKENDS/HOLIDAY 16 BUS 04168 10051 2
2023-08 WEEKENDS/HOLIDAY 14 BUS 80119 90079 3
2023-08 WEEKDAY 14 BUS 80119 90079 10
2023-08 WEEKENDS/HOLIDAY 17 BUS 44069 17229 5
2023-08 WEEKDAY 17 BUS 44069 17229 4
2023-08 WEEKENDS/HOLIDAY 17 BUS 20281 20141 3
2023-08 WEEKDAY 17 BUS 20281 20141 22
2023-08 WEEKDAY 7 BUS 19051 10017 3
2023-08 WEEKENDS/HOLIDAY 17 BUS 11169 04219 3
  • YEAR_MONTH: Data collection month in the format of YYYY-MM.

  • DAY_TYPE: Weekday or Weekends/Holiday.

  • TIME_PER_HOUR: Hour of the day in 24 hour format.

  • PT_TYPE: Type of public transportation.

  • ORIGIN_PT_CODE: Identifier for the bus stop where the trip originated.

  • DESTINATION_PT_CODE: Identifier for the bus stop where the trip ended.

  • TOTAL_TRIPS: Total number of trips recorded for each origin-destination pair.

Code
summary(origind)
  YEAR_MONTH          DAY_TYPE         TIME_PER_HOUR     PT_TYPE         
 Length:17118005    Length:17118005    Min.   : 0.00   Length:17118005   
 Class :character   Class :character   1st Qu.:10.00   Class :character  
 Mode  :character   Mode  :character   Median :14.00   Mode  :character  
                                       Mean   :14.06                     
                                       3rd Qu.:18.00                     
                                       Max.   :23.00                     
 ORIGIN_PT_CODE     DESTINATION_PT_CODE  TOTAL_TRIPS      
 Length:17118005    Length:17118005     Min.   :    1.00  
 Class :character   Class :character    1st Qu.:    2.00  
 Mode  :character   Mode  :character    Median :    4.00  
                                        Mean   :   20.46  
                                        3rd Qu.:   12.00  
                                        Max.   :36668.00  
Code
# Count unique values in ORIGIN_PT_CODE
unique_origin_count <- n_distinct(origind$ORIGIN_PT_CODE)

# Count unique values in DESTINATION_PT_CODE
unique_destination_count <- n_distinct(origind$DESTINATION_PT_CODE)

# Print the counts
print(paste("Unique origins:", unique_origin_count))
[1] "Unique origins: 5075"
Code
print(paste("Unique destinations:", unique_destination_count))
[1] "Unique destinations: 5079"
Code
# Calculate the mean of TOTAL_TRIPS
mean_total_trips <- median(origind$TOTAL_TRIPS)

# Count the number of rows where TOTAL_TRIPS is above the mean
rows_above_mean <- sum(origind$TOTAL_TRIPS > mean_total_trips)

# Print the count
print(rows_above_mean)
[1] 7977626
  • Most of the data types have a Class and Mode of “character”

  • Over a three-month period, a total of 17,118,005 combinations of bus trips were recorded.

  • There are 5075 unique origin bus stops, and 5079 unique destination bus stops

  • On average, there are 20.46 trips for each bus route, but the highest recorded number of trips for a single route is an exceptional 36,668. This significant discrepancy suggests there might be outliers or anomalies in the data, warranting additional investigation.

  • In a dataset of 17 million trips, the low median of 4, high mean of 20.5, and a maximum of 38,000, along with over 7.9 million trips exceeding the median, indicate a right-skewed distribution. This suggests a concentration of lower-value trips with numerous higher-value outliers.

1.2.2 Geospatial data - bus_stop_location_data

Code
busstop <- st_read(
    dsn = "data/geospatial",
    layer = "BusStop"
  ) %>%
  mutate(
    BUS_STOP_N = as.factor(BUS_STOP_N),
    BUS_ROOF_N = as.factor(BUS_ROOF_N),
    LOC_DESC = as.factor(LOC_DESC)
  )
Reading layer `BusStop' from data source 
  `C:\Zackkoh94\ISSS624\Take-home_Ex1\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Code
glimpse(busstop)
Rows: 5,161
Columns: 4
$ BUS_STOP_N <fct> 22069, 32071, 44331, 96081, 11561, 66191, 23389, 54411, 285…
$ BUS_ROOF_N <fct> B06, B23, B01, B05, B05, B03, B02A, B02, B09, B01, B16, B02…
$ LOC_DESC   <fct> OPP CEVA LOGISTICS, AFT TRACK 13, BLK 239, GRACE INDEPENDEN…
$ geometry   <POINT [m]> POINT (13576.31 32883.65), POINT (13228.59 44206.38),…
Code
st_crs(busstop)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]

The EPSG.io database indicates that the coordinate system for Singapore is SVY21, associated with the EPSG code 3414. However, the ‘busstop’ dataset is currently projected using SVY21 with an EPSG code of 9001, highlighting a need to change to correct it to the EPSG code of 3414.

Code
# Setting the CRS for the busstop data to EPSG 3414
busstop <- st_set_crs(busstop, 3414) %>%
  # Changing the column name for easier integration with main dataframe
  mutate(
    ORIGIN_PT_CODE = as.factor(BUS_STOP_N)
  ) %>%
  # Keeping only necessary columns for further analysis
  select(
    ORIGIN_PT_CODE, 
    LOC_DESC,
    geometry
  )

# Verifying the CRS assignment for busstop
st_crs(busstop)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

2 Data Preparation

2.1 Aspatial Data

The task specified in this particular project requires the computation of passenger trips at the following specific timeframes.

Peak hour period Bus tap on time
Weekday morning peak 6am to 9am
Weekday afternoon peak 5pm to 8pm
Weekend/holiday morning peak 11am to 2pm
Weekend/holiday evening peak 4pm to 7pm
Code
origind <- origind %>%
  mutate(
    ORIGIN_PT_CODE = as.factor(ORIGIN_PT_CODE),
    DESTINATION_PT_CODE = as.factor(DESTINATION_PT_CODE)
  )

As mentioned before, ORIGIN_PT_CODE and DESTINATION_PT_CODE’s data type are in character format. Because they represent busstop locations, they should be transformed into factors (categorical data type) for further analysis

Code
origind_agg <- origind %>%
  # Classify trips into specific time periods based on day type and hour
  mutate(period = ifelse(DAY_TYPE == "WEEKDAY" & 
                         TIME_PER_HOUR >= 6 & TIME_PER_HOUR <= 9, 
                         "Weekday morning peak",
                    ifelse(DAY_TYPE == "WEEKDAY" & 
                           TIME_PER_HOUR >= 17 & TIME_PER_HOUR <= 20,
                           "Weekday evening peak",
                      ifelse(DAY_TYPE == "WEEKENDS/HOLIDAY" &
                             TIME_PER_HOUR >= 11 & TIME_PER_HOUR <= 14,
                              "Weekend/PH morning peak",
                        ifelse(DAY_TYPE == "WEEKENDS/HOLIDAY" & 
                              TIME_PER_HOUR >= 16 & TIME_PER_HOUR <= 19,
                               "Weekend/PH evening peak",
                    "Others"))))
  ) %>%
  # Exclude data outside the specified periods for focused analysis
  filter(
    period != "Others"
  ) %>%
  # Aggregate the total number of trips by origin bus stop and month for each classified period
  group_by(
    YEAR_MONTH,
    period,
    ORIGIN_PT_CODE
  ) %>%
  summarise(
    num_trips = sum(TOTAL_TRIPS)
  ) %>%
  ungroup()

We categorize and preprocess the data into the time frames as shown above

Code
list(origind_agg)
[[1]]
# A tibble: 60,168 × 4
   YEAR_MONTH period               ORIGIN_PT_CODE num_trips
   <chr>      <chr>                <fct>              <dbl>
 1 2023-08    Weekday evening peak 01012               8448
 2 2023-08    Weekday evening peak 01013               7328
 3 2023-08    Weekday evening peak 01019               3608
 4 2023-08    Weekday evening peak 01029               9317
 5 2023-08    Weekday evening peak 01039              12937
 6 2023-08    Weekday evening peak 01059               2133
 7 2023-08    Weekday evening peak 01109                322
 8 2023-08    Weekday evening peak 01112              45010
 9 2023-08    Weekday evening peak 01113              27233
10 2023-08    Weekday evening peak 01119               9323
# ℹ 60,158 more rows

2.2 Data Checks and wrangling

Code
# Count of dupes
count_duplicate_rows <- function(df, df_name) {
  df %>%
    group_by_all() %>%
    filter(n() > 1) %>%
    ungroup() %>%
    summarise(df_name = df_name, row_count = n())
}

duplicate1 <- count_duplicate_rows(bus08, "bus08")
duplicate2 <- count_duplicate_rows(bus09, "bus09")
duplicate3 <- count_duplicate_rows(bus10, "bus10")

all_duplicates <- bind_rows(duplicate1, duplicate2, duplicate3)

# Output the combined counts
print(all_duplicates)
# A tibble: 3 × 2
  df_name row_count
  <chr>       <int>
1 bus08           0
2 bus09           0
3 bus10           0

There no duplicates which could adversely impact any join functions in the latter part of the project

Code
# Function to count the number of rows containing null values
count_null_rows <- function(df, df_name) {
  df %>%
    filter(if_any(everything(), is.na)) %>%
    summarise(df_name = df_name, row_count = n())
}

# Apply the function to each dataframe to count rows with nulls
nulls1 <- count_null_rows(bus08, "bus08")
nulls2 <- count_null_rows(bus09, "bus09")
nulls3 <- count_null_rows(bus10, "bus10")

# Combine the results
all_nulls <- bind_rows(nulls1, nulls2, nulls3)

# Print the counts of rows with nulls
print(all_nulls)
# A tibble: 3 × 2
  df_name row_count
  <chr>       <int>
1 bus08           0
2 bus09           0
3 bus10           0

There no nulls which could adversely impact any join functions in the latter part of the project

3 Data Exploration

Understanding daily passenger trip distribution is crucial for urban traffic management and congestion relief, with geospatial analysis providing insights into areas of dense commuter traffic and patterns.

As a start, we will look at the distribution of passenger trips for different periods in September 2023

Code
# Generate scatter plots for September 2023 bus passenger data
bus09_scatter <- origind_agg %>%
  filter(YEAR_MONTH == "2023-09") %>%  # Focus on September 2023
  ggplot(aes(x = num_trips, y = period, fill = period, color = period)) +  # Set plot aesthetics
  geom_point(size = 3, alpha = 0.7) +  # Create scatter plot
  scale_x_continuous(  # Configure x-axis scale
    labels = scales::number_format(accuracy = 1), 
    breaks = scales::pretty_breaks(n = 5)) +
  scale_fill_brewer(palette = "Set2") +  # Use a Brewer color palette
  labs(
    title = "September 2023: Peak Passenger Volumes in Weekday Mornings",
    subtitle = "Variability during weekdays suggests potential congestion") + 
  theme(
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    axis.text.x = element_blank()  # Remove x-axis labels for clarity
  )  # Adjust plot theme for a clean appearance

# Display the plot
bus09_scatter

  • Passenger counts are notably higher during peak weekday hours as opposed to weekend or public holiday peaks.

  • Each peak hour time frame exhibits a distribution skewed toward higher values (right), indicating sporadic instances of unusually high passenger numbers.

  • Such patterns may point to congestion at the Central Business District (CBD), bus stops near the highways, where passengers frequent to change buses, or Bus interchanges.

  • Subsequent investigation could ascertain if these occurrences are geographically clustered, helping to identify probable congestion focal points.

Next we explore if the distribution changes across different months (i.e. August to October 2023)

Code
# Generate scatter plots for bus passenger data
origind_scatter <- origind_agg %>%
  ggplot(aes(x = period, y = num_trips, fill = period, color = period)) +
  geom_violin(position = position_nudge(x = 0, y = .2), alpha = 0.5) +  # Create violin plots
  geom_point(aes(y = num_trips, color = period), 
             position = position_jitter(height = .15), size = .5, alpha = 0.8) +  # Add jittered points
  facet_wrap(~YEAR_MONTH, nrow = 1) +  # Facet by YEAR_MONTH for comparison
  scale_y_continuous(labels = scales::number_format(accuracy = 1), 
                     breaks = scales::pretty_breaks(n = 3)) +  # Adjust y-axis scale
  labs(title = "Passenger Volume Distribution: Aug - Oct 2023") +  # Add a descriptive title
  theme(
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.x = element_blank()  # Remove x-axis labels for clarity
  )  

# Display the plot
origind_scatter

The scatter plots indicate a consistent trip distribution across three months, suggesting the presence of bus stops with high passenger volumes, particularly during weekday evenings. This consistency hints at potential congestion hot spots, particularly if these stops are clustered. Consequently, focused Geospatial analysis on September 2023 data will be used to identify these areas.

4 Spatial Polygon preparation

4.1 Month- Data Extraction

extract September 2023 data

Code
# Isolate September 2023 data for focused analysis
origind09 <- origind_agg %>%
  filter(YEAR_MONTH == "2023-09") %>%  # Targeting September 2023
  pivot_wider(  # Transform data to wide format for easier analysis
    names_from = period,
    values_from = num_trips
  ) %>%
  select(-YEAR_MONTH)  # Exclude YEAR_MONTH column from the final output

# Create a datatable visualization for the transformed data
DT::datatable(origind09,
              options = list(pageLength = 5, autoWidth = TRUE),
              rownames = FALSE)  # Configure table display options

4.2 Left Join to create dataframe

Code
origind09_sf <- left_join(busstop, origind09, by = "ORIGIN_PT_CODE")

origind09_sf
Simple feature collection with 5161 features and 6 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21 / Singapore TM
First 10 features:
   ORIGIN_PT_CODE             LOC_DESC Weekday evening peak
1           22069   OPP CEVA LOGISTICS                   44
2           32071         AFT TRACK 13                   NA
3           44331              BLK 239                 1524
4           96081 GRACE INDEPENDENT CH                  274
5           11561              BLK 166                  134
6           66191         AFT CORFE PL                  325
7           23389              PEC LTD                  388
8           54411              BLK 527                 1204
9           28531              BLK 536                 3529
10          96139              BLK 148                 1152
   Weekday morning peak Weekend/PH evening peak Weekend/PH morning peak
1                    13                       2                      20
2                    NA                       1                      NA
3                  1701                     630                     707
4                   286                      99                     114
5                   152                      79                      77
6                   386                     193                     234
7                    49                      81                      18
8                  2764                     649                    1612
9                  8000                    2222                    2373
10                 4406                     735                     997
                    geometry
1  POINT (13576.31 32883.65)
2  POINT (13228.59 44206.38)
3   POINT (21045.1 40242.08)
4  POINT (41603.76 35413.11)
5  POINT (24568.74 30391.85)
6  POINT (30951.58 38079.61)
7    POINT (12476.9 32211.6)
8  POINT (30329.45 39373.92)
9  POINT (14993.31 36905.61)
10  POINT (41642.81 36513.9)

4.3 Creating Hexagon Spatial Grid

origind09_sf , being a dataframe containing spatial points, isn’t ideal for spatial autocorrelation studies that necessitate ‘areas’ to be depicted as polygons. The code that follows will reorganize these bus stop points into a hexagonal lattice for better spatial analysis.

Code
origind09_hx <- st_make_grid(
    origind09_sf,
    cellsize = 500,
    square = FALSE
  ) %>%
  st_sf() %>%
  rowid_to_column("hex_id")

The code above does the following:

  • The cellsize parameter is measured in the same units as the origind09_sf dataframe’s coordinate system. Since origind09_sf uses EPSG:3414 projection, which has units in meters, this cellsize defines the hexagon’s width, here chosen to be 500 meters.

  • Each hexagon is given a unique hex_id as the primary key.

4.4 Generate Attribute Dataframe using Hexagon Identifiers

The hexagonal grid is designed to contain multiple bus stops, with each hexagon’s hex_id acting as the key for compiling data. The following steps detail the process for structuring attributes based on hex_id:

  • Apply st_join() with the st_within join condition to match bus stop points to their respective hexagons.

  • Use st_set_geometry(NULL) to remove the spatial aspect, shifting the focus to attribute consolidation.

  • Implement group_by() to consolidate data under a distinct hex_id.

  • Deploy summarise() to tally the number of bus stops and to sum up trips for each peak period by hexagon.

  • Convert any NA entries to 0 with replace(is.na(.), 0) to clean the dataset. #This is just a precaution although we already did the checks prior

Code
# Joining bus stops with hexagonal areas and summarizing data
origind09_stops <- origind09_sf %>%
  st_join(origind09_hx, join = st_within) %>%  # Associate bus stops with hexagons
  group_by(hex_id) %>%  # Group by hexagon ID
  summarise(
    total_busstops = n(),  # Count bus stops per hexagon
    busstop_ids = str_c(ORIGIN_PT_CODE, collapse = ", "),  # Combine bus stop codes
        loc_descriptions = str_c(LOC_DESC, collapse = "; "),  # Combine location descriptions
    peak_weekday_morning = sum(`Weekday morning peak`),  # Sum for weekday morning peak
    peak_weekday_evening = sum(`Weekday evening peak`),  # Sum for weekday evening peak
    peak_weekend_morning = sum(`Weekend/PH morning peak`),  # Sum for weekend morning peak
    peak_weekend_evening = sum(`Weekend/PH evening peak`)   # Sum for weekend evening peak
  ) %>%
  st_set_geometry(NULL) %>%  # Remove spatial component
  replace_na(list(peak_weekday_morning = 0, peak_weekday_evening = 0,  # Replace NA with 0
                  peak_weekend_morning = 0, peak_weekend_evening = 0)) %>%
  ungroup()  # Remove grouping

# Display the processed data
origind09_stops
# A tibble: 1,524 × 8
   hex_id total_busstops busstop_ids  loc_descriptions      peak_weekday_morning
    <int>          <int> <chr>        <chr>                                <dbl>
 1     34              1 25059        AFT TUAS STH BLVD                       91
 2     65              1 25751        BEF TUAS STH AVE 14                     41
 3     99              1 26379        YONG NAM                                50
 4    127              1 25761        OPP REC S'PORE                         129
 5    129              2 25719, 26389 THE INDEX; BEF TUAS …                 1104
 6    130              1 26369        SEE HUP SENG                            34
 7    131              1 26299        BEF TUAS STH AVE 6                      41
 8    159              1 25741        HALLIBURTON                             66
 9    160              1 25711        OPP THE INDEX                          204
10    161              2 25701, 25709 GLAXOSMITHKLINE; EDG…                  842
# ℹ 1,514 more rows
# ℹ 3 more variables: peak_weekday_evening <dbl>, peak_weekend_morning <dbl>,
#   peak_weekend_evening <dbl>

4.5 Create Spatial Polygon dataframe

Code
origind09_hx <- origind09_hx %>%
  left_join(origind09_stops,
            by = "hex_id"
  ) %>%
  replace(is.na(.), 0)

bus_traffic_09 <- filter(origind09_hx,
                       total_busstops > 0)

5 5.0 Visualization and Sense Making

This sections aims to conduct visualization and sense making of the current data before embarking on more complex methodologies such as Local Indicators of Spatial Association (LISA)

Code
# Switch to tmap interactive viewing mode
tmap_mode("view")

# Creating an interactive map of bus traffic
bus_traffic_09_map <- tm_basemap("CartoDB.Positron") +
  tm_shape(bus_traffic_09) +  # Connecting tm_shape directly to tm_fill
  tm_fill(
    col = "total_busstops",  # Color based on the number of bus stops
    palette = "YlGnBu",  # Using the YlGnBu color palette
    style = "cont",  # Continuous color style
    id = "loc_descriptions",  # Identify hexagons by hex_id
    popup.vars = c("Bus Stops Count" = "total_busstops",
                   "Stop Codes" = "busstop_ids"),  # Popup information
    title = "Number of Bus Stops"  # Title for the color legend
  ) +
  tm_layout(
    legend.show = FALSE)  

# Display the interactive map
bus_traffic_09_map

The map indicates that the central area, which likely includes the Business Districts, the likes of the Central Business districts or One North Business parks, as well as highly populated residential zones areas like Seng Kang, have a higher concentration of bus stops. These areas are known for their high human traffic which necessitates the need for a more conducive and extensive public transportation network to facilitate the daily travel needs of its patrons.

In contrast, the more unvisited areas of the island, such as Lim Chu Kang, an area marked for more agricultural activities, are marked by lighter shades, highlighting a sparser presence of bus stops.
These regions, which are less urbanized or more industrialized typically see lower human traffic.

The distribution of bus stops in Singapore as seen in the map emulates Singapore’s Land Transport Authority (LTA)’s strategic approach to provide an efficient bus serivice, focusing on providing efficient access and use of resources, as well as connectivity, especially in areas with high human traffic where public transport demand is higher.

5.1 Peak Period Analysis

Code
summary(origind09_sf)
 ORIGIN_PT_CODE                LOC_DESC    Weekday evening peak
 11009  :   2   BLK 1              :   6   Min.   :     1      
 22501  :   2   BLK 25             :   6   1st Qu.:   639      
 43709  :   2   AFT YIO CHU KANG RD:   5   Median :  1797      
 47201  :   2   BLK 101            :   5   Mean   :  4464      
 51071  :   2   BLK 109            :   5   3rd Qu.:  4080      
 52059  :   2   (Other)            :5126   Max.   :481495      
 (Other):5149   NA's               :   8   NA's   :170         
 Weekday morning peak Weekend/PH evening peak Weekend/PH morning peak
 Min.   :     1       Min.   :     1.0        Min.   :     1         
 1st Qu.:   414       1st Qu.:   228.5        1st Qu.:   207         
 Median :  1946       Median :   715.0        Median :   742         
 Mean   :  4560       Mean   :  1707.7        Mean   :  1689         
 3rd Qu.:  5430       3rd Qu.:  1666.0        3rd Qu.:  1889         
 Max.   :328545       Max.   :158693.0        Max.   :112330         
 NA's   :176          NA's   :202             NA's   :180            
          geometry   
 POINT        :5161  
 epsg:3414    :   0  
 +proj=tmer...:   0  
                     
                     
                     
                     

Reinforcing the statements above, the summary has concluded that the data is indeed right skewed, as such the breaks for the following maps will be adjusted to better illustrate the comparisons between peak hour time frames

Code
peak_weekday_morning_map <- tm_basemap("CartoDB.Positron") +
  tm_shape(bus_traffic_09) +
  tm_fill(
    col = "peak_weekday_morning",
    palette = "YlGnBu",
    title = "Peak Weekday Morning Traffic",
    id = "loc_descriptions",
    showNA = FALSE,
    alpha = 0.6,
    breaks = c(0, 500, 1000, 2000, 3000, 5000, 10000, 50000, 100000, 300000, 550000)
  ) +
  tm_borders() +
  tm_layout(
    legend.show = TRUE,
    legend.position = c("right", "top"),
    legend.width = 0.2,
    legend.height = 0.5
  )

peak_weekday_morning_map
Code
peak_weekday_evening_map <- tm_basemap("CartoDB.Positron") +
  tm_shape(bus_traffic_09) +
  tm_fill(
    col = "peak_weekday_evening",
    palette = "YlOrRd",
    title = "Peak Weekday Evening Traffic",
    id = "loc_descriptions",
    showNA = FALSE,
    alpha = 0.6,
    breaks = c(0, 500, 1000, 2000, 3000, 5000, 10000, 50000, 100000, 300000, 550000)
  ) +
  tm_borders() +
  tm_layout(
    legend.show = TRUE,
    legend.position = c("right", "top"),
    legend.width = 0.2,
    legend.height = 0.5
  )

peak_weekday_evening_map
Code
peak_weekend_evening_map <- tm_basemap("CartoDB.Positron") +
  tm_shape(bus_traffic_09) +  # Added '+' here
  tm_fill(
    col = "peak_weekend_evening",
    palette = "RdPu",
    title = "Peak Weekend Evening Traffic",
    id = "loc_descriptions",
    showNA = FALSE,
    alpha = 0.6,
    breaks = c(0, 500, 1000, 2000, 3000, 5000, 10000, 50000, 100000, 300000, 550000)
  ) +
  tm_borders() +
  tm_layout(
    legend.show = TRUE,
    legend.position = c("right", "top"),
    legend.width = 0.2,
    legend.height = 0.5
  )

peak_weekend_evening_map
Code
# Creating a static map of peak weekend evening bus traffic
peak_weekend_evening_map <- tm_basemap("CartoDB.Positron") +
  tm_shape(bus_traffic_09) +
  tm_fill(
    col = "peak_weekend_evening",
    palette = "Blues",
    title = "Peak Weekend Evening Traffic",
    id = "loc_descriptions",
    showNA = FALSE,
    alpha = 0.6,
    breaks = c(0, 500, 1000, 2000, 3000, 5000, 10000, 50000, 100000, 300000, 550000)
  ) +
  tm_borders() +
  tm_layout(legend.show = TRUE)

# Display the map
peak_weekend_evening_map

Unsurprisingly, the weekdays’ trip volumes surpass that of the weekends/holidays, most plausibly due to the working crowd. Nonetheless, the areas regardless of the peak hour time frames, the areas which see higher human traffic remains unchanged, highlighting the efficacy of the Singapore bus network in serving the needs of the commuters.

Code
library(ggplot2)
library(gridExtra)

# Function to create a scatter plot
create_scatter_plot <- function(data, y_column_name, title, color = "blue") {
  ggplot(data, aes_string(x = "total_busstops", y = y_column_name)) +
    geom_point(alpha = 0.7, color = color) +
    ylim(0, 500000) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) +
    labs(title = title, x = "Number of Bus Stops", y = "") +
    theme(panel.grid = element_blank())  # Removed axis.text.y = element_blank()
}

# Creating individual scatter plots
morning_peak_weekday <- create_scatter_plot(bus_traffic_09, "peak_weekday_morning", "Weekday Morning Peak", "blue")
evening_peak_weekday <- create_scatter_plot(bus_traffic_09, "peak_weekday_evening", "Weekday Evening Peak", "blue")
morning_peak_weekend <- create_scatter_plot(bus_traffic_09, "peak_weekend_morning", "Weekend/PH Morning Peak", "green")
evening_peak_weekend <- create_scatter_plot(bus_traffic_09, "peak_weekend_evening", "Weekend/PH Evening Peak", "green")

# Combining the plots using grid.arrange
combined_plot <- grid.arrange(morning_peak_weekday, evening_peak_weekday, morning_peak_weekend, evening_peak_weekend, ncol = 2, nrow = 2)

The volume of passenger trips appears to be greatest in regions having 4 to 8 bus stops. Contrarily, areas with the most bus stops, ranging from 9 to 11, did not experience the highest levels of passenger traffic during any peak times.

This indicates the possibility of an ideal number of bus stops within each 500m-wide area that could help in alleviating congestion, measures taken could be to look into replanning bus stops placements so as to not have diminishing returns.

6 Local Indicators of Spatial Association

In preparing for spatial autocorrelation analysis, we first establish a spatial weights matrix to map the interconnections between hexagonal spatial units by their distance. Key to this setup is ensuring that each unit has at least one, but not all, other units as neighbors to reflect diverse spatial relationships accurately.

Given our dataset’s skewed distribution, we’ve chosen to assign 10 neighbors per feature, exceeding the suggested minimum of eight. This approach enhances the robustness of our spatial connectivity analysis.

Our study area, marked by unevenly distributed bus stops and zones with low residential or business density, leads us to prefer distance-based methods over contiguity-based ones. This choice aligns better with our data’s spatial characteristics.

We adopt the Adaptive Distance-Based method, with a fixed number of neighbors, to accommodate our dataset’s right-skewed distribution. This method ensures each hexagon connects with exactly 10 neighbors. We employ the st_knn function to identify neighbors and st_weights to create row-standardized spatial weights, setting a solid foundation for our spatial autocorrelation analysis.

6.1 Global Spatial Association Globally with Global Moran’s I

Code
# Adjusting bus_traffic_09 data with spatial weights
weightmat_all <- bus_traffic_09 %>%
  mutate(
    knn = st_knn(geometry, k = 10),  # Calculating 10 nearest neighbors
    weight_type = st_weights(knn, style = 'W'),  # Generating weights
    .before = 1  # Placing the new columns at the beginning
  )


# check the output
kable(head(weightmat_all))
knn weight_type hex_id total_busstops busstop_ids loc_descriptions peak_weekday_morning peak_weekday_evening peak_weekend_morning peak_weekend_evening geometry
2, 4, 5, 8, 9, 12, 16, 22, 23, 38 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 34 1 25059 AFT TUAS STH BLVD 91 348 0 96 POLYGON ((3970.122 27925.48...
1, 4, 5, 8, 9, 12, 16, 22, 23, 38 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 65 1 25751 BEF TUAS STH AVE 14 41 131 38 40 POLYGON ((4220.122 28358.49...
5, 6, 9, 10, 13, 14, 16, 17, 24, 25 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 99 1 26379 YONG NAM 50 278 78 84 POLYGON ((4470.122 30523.55...
1, 2, 8, 9, 12, 16, 22, 23, 38, 39 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 127 1 25761 OPP REC S'PORE 129 1689 246 454 POLYGON ((4720.122 28358.49...
3, 6, 9, 10, 12, 13, 14, 16, 17, 24 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 129 2 25719, 26389 THE INDEX; BEF TUAS STH AVE 5 1104 2608 503 646 POLYGON ((4720.122 30090.54...
3, 5, 7, 9, 10, 13, 14, 17, 18, 25 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 130 1 26369 SEE HUP SENG 34 185 117 27 POLYGON ((4720.122 30956.57...
Code
# Ensure consistent results
set.seed(5555)

# Assuming 'bus_traffic_09' is a spatial dataset with a 'geometry' column
# List of columns for peak traffic analysis
bus_traffic_columns <- c("peak_weekday_morning", "peak_weekday_evening", "peak_weekend_morning", "peak_weekend_evening")

# Function to calculate global Moran's I
calculate_moran_i <- function(dataset, column_name, neighbors) {
  # Validate that the dataset is properly structured as a spatial data frame
  if (!("geometry" %in% names(dataset))) {
    stop("The dataset does not have a geometry column.")
  }

  # Validate that the column_name exists in the dataset
  if (!(column_name %in% names(dataset))) {
    stop(paste("Column", column_name, "not found in the dataset."))
  }

  # Establishing spatial relationships
  neighbors_list <- st_knn(dataset$geometry, k = neighbors)
  weights <- st_weights(neighbors_list, style = 'W')

  # Executing Moran's I calculation
  moran_test_result <- global_moran_perm(dataset[[column_name]], neighbors_list, weights, nsim = 99)
  
  # Organizing the output
  output <- list(
    column = column_name,
    moran_i = moran_test_result
  )
  
  return(output)
}

# Running the Moran's I calculation for each traffic time slot
moran_i_results <- lapply(bus_traffic_columns, function(column) calculate_moran_i(bus_traffic_09, column, 10))

# Displaying the calculated results
print(moran_i_results)
[[1]]
[[1]]$column
[1] "peak_weekday_morning"

[[1]]$moran_i

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.21341, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided



[[2]]
[[2]]$column
[1] "peak_weekday_evening"

[[2]]$moran_i

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.061867, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided



[[3]]
[[3]]$column
[1] "peak_weekend_morning"

[[3]]$moran_i

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.15841, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided



[[4]]
[[4]]$column
[1] "peak_weekend_evening"

[[4]]$moran_i

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.11109, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided

For each of the four time slots, the p-values associated with the Global Moran’s I test are less than 0.05, indicating a rejection of the null hypothesis, which suggests randomness in spatial patterns.

Additionally, the positive values of the Moran’s I statistics imply a tendency towards clustering in the spatial distribution across all time intervals.

6.2 LISA assessment

Investigating spatial patterns at a detailed level, we utilize the Local Moran’s I analysis to identify specific areas with strong or weak spatial associations. The methodology categorizes regions based on clustering types like high-high or low-low, revealing areas of similar or contrasting characteristics. We apply the local_moran function from the sfdep package for examining passenger trips at a hexagonal level, which calculates neighbors and weights automatically and uses simulated data for precision.

Code
# Create a function to perform local Moran's I analysis
get_lisa <- function(weightmat_all, bus_traffic_column, k) {
  # Check if the bus_traffic_column is valid
  if (nchar(bus_traffic_column) == 0) {
    stop("The column name provided is empty.")
  }

  # Compute spatial weights
  nb <- st_knn(weightmat_all$geometry, k = k)
  wt <- st_weights(nb, style = 'W')

  # Perform local Moran's I analysis and create a new data frame
  result <- weightmat_all %>% 
    mutate(
      local_moran = local_moran(.data[[bus_traffic_column]], nb, wt, nsim = 99),
      .before = 1
    ) %>%
    unnest(local_moran)
  
  return(result)
}

# Assuming bus_traffic_columns is a vector of column names
bus_traffic_columns <- c("peak_weekday_morning", "peak_weekday_evening", "peak_weekend_morning", "peak_weekend_evening")

# Initialize a list to store results for each bus_traffic_column
lisa_results <- list()

# Apply the function for each bus traffic time interval and store results in the list
for (btc in bus_traffic_columns) {
  lisa_results[[btc]] <- get_lisa(weightmat_all, btc, k = 10)
  
  # Remove columns that don't belong to the specific time interval
  unwanted_columns <- setdiff(bus_traffic_columns, btc)
  lisa_results[[btc]] <- lisa_results[[btc]][, !(names(lisa_results[[btc]]) %in% unwanted_columns)]
}

# Show sample output in an interactive table
datatable(lisa_results[["peak_weekday_morning"]], 
          class = 'cell-border stripe', 
          options = list(pageLength = 5))

We now utilize the tmap package to generate choropleth maps that display Local Moran’s I values and corresponding p-values for four time intervals. The maps will highlight only significant Local Moran’s I values at a 5% significance level. The existing code is specifically used to extract these key Local Moran’s I values for map creation.

Code
get_sig_lmi_map <- function(lisa_table, title) {
  
  sig_lisa_table <- lisa_table %>%
    filter(p_ii_sim < 0.05)
  
  result <- tm_shape(lisa_table) +
    tm_polygons() +
    tm_borders(alpha = 0.5) +
    tm_shape(sig_lisa_table) +
    tm_fill("ii", palette = "Reds") + 
    tm_borders(alpha = 0.4) +
    tm_layout(
      main.title = title,
      main.title.size = 1.3
    )
  
  return(result)
  
}
# Applying the function to create maps for different peak intervals

sig_lmi_1 <- get_sig_lmi_map(lisa_results[["peak_weekday_morning"]], "Weekday Morning" )
sig_lmi_2 <- get_sig_lmi_map(lisa_results[["peak_weekday_evening"]], "Weekday Afternoon" )
sig_lmi_3 <- get_sig_lmi_map(lisa_results[["peak_weekend_morning"]], "Weekend Morning" )
sig_lmi_4 <- get_sig_lmi_map(lisa_results[["peak_weekend_evening"]], "Weekend Afternoon" )

tmap_mode('view')

tmap_arrange(
  sig_lmi_1,
  sig_lmi_2,
  sig_lmi_3,
  sig_lmi_4,
  asp = 2,
  nrow = 2,
  ncol = 2
)

Now we zoom into an analysis of which α = 5% for Local Indicators of Spatial Association (LISA)

Code
# Function for constructing thematic maps based on significant Local Moran's I data
generate_thematic_lisa_maps <- function(lisa_data_set, chart_title) {
  
  # Filtering for significant Local Moran's I values
  significant_lisa_data <- lisa_data_set %>%
    filter(p_ii_sim  < 0.05)
  
  # Building the map visualization
  thematic_map_output <- tm_shape(lisa_data_set) +
        tm_polygons(alpha = 0.5) +
        tm_borders(alpha = 0.5) +
        tm_shape(significant_lisa_data) +
        tm_fill("median",  
                id = "loc_desc",  
                palette = c("deepskyblue", "salmon", "green", "darkred"),
                alpha = 0.7) +
        tm_borders(alpha = 0.4) +
        tm_layout(
            main.title = chart_title,
            main.title.size = 1.5,
            legend.position = c("left", "top")
        )

    return(thematic_map_output)
}

# Creating maps for different time intervals
lisa_map_morning_peak <- generate_thematic_lisa_maps(lisa_results[["peak_weekday_morning"]], "Significant LISA - Weekday Morning")
lisa_map_evening_peak <- generate_thematic_lisa_maps(lisa_results[["peak_weekday_evening"]], "Significant LISA - Weekday Evening")
lisa_map_weekend_morning <- generate_thematic_lisa_maps(lisa_results[["peak_weekend_morning"]], "Significant LISA - Weekend Morning")
lisa_map_weekend_evening <- generate_thematic_lisa_maps(lisa_results[["peak_weekend_evening"]], "Significant LISA - Weekend Evening")
Code
tmap_mode('plot')
lisa_map_morning_peak

Code
tmap_mode('plot')
lisa_map_evening_peak

Code
tmap_mode('plot')
lisa_map_weekend_morning

Code
tmap_mode('plot')
lisa_map_weekend_evening

  • Deep Sky Blue Areas (Low-Low Clusters): These zones are characterized by fewer trips at bus stops, which are also surrounded by other areas with low trip frequencies, forming a cluster of less busy locations.

  • Green Areas (Low-High Outliers): These areas show unique patterns where bus stops have fewer trips in contrast to neighboring areas, indicating isolated spots of lower activity amidst busier surroundings.

  • Salmon Areas (High-Low Outliers): These regions are marked by bus stops with notably higher trip counts than their neighboring areas, highlighting them as exceptional spots of increased activity within less active zones.

  • Dark Red Areas (High-High Clusters): Here, bus stops see a higher volume of trips, and are in proximity to areas with similarly high activity, suggesting a concentration of busy locations.

6.3 Findings

High-High

  1. Weekday morning vs evening, generally see the high-high clusters in the same areas (Business Districts/ Residential areas, however it is noticed that the clusters generally shrink in size. which could be due to the following reasons:

    1. Morning Peak: Bus use spikes in the morning due to simultaneous commutes to work and schools, creating distinct high-usage clusters.

    2. Evening Spread: Evening could be seeing a staggered exodus from business hubs to varied locations, diffusing the earlier high-usage patterns.

    3. Shift in Activities: Daytime high traffic in business and industrial zones transitions to residential and leisure areas at night, altering usage patterns.1.

    4. Varied Destinations: Evening trips are more dispersed as people head to homes, dining, or leisure spots, reducing reliance on main transit hubs.

    5. Diverse Evening Travel: With more unpredictable routes and choices for non-commuting travel in the evening, the uniform high-usage clusters dissipate.

  2. Weekend mornings see concentrated bus travel to areas busy with shopping and leisure, forming high-high clusters. By evening, these clusters diminish as daytime activities cease and the commuters might have returned home in a staggered fashion outside of the peak hours

Low-Low

  1. Bus trips are less frequent in the industrial western sectors of Singapore, indicating lower public transport usage, possibly due to a sparse population or alternative transport options. These patterns showcase the area’s distinct travel dynamics.

  2. Comparing the maps for weekday mornings and evenings or weekend mornings and evenings, one may notice changes in the size and distribution of low-low clusters. These changes can provide insights into how public transportation demand varies throughout the day and week.

  3. Notably, you can see some areas such as Woodlands Waterfront Park and West Coast park see their Low-Low cluster shrinking during the weekday evening

  4. Similar to the weekday, the low low clusters for the weekend are typically in the industrial areas in the west for both morning and evening